This project aims to enhance workplace safety in industrial and construction environments through a real-time moitoring system. The system will leverage advanded computer vision techniques, like YOLO, to ensure that workers are wearing personal protective equipment (PPE) and staying safe around hazardous tools and areas. The project has the following three phases planned:

The workflow begins with an image frame input, typically sourced from a camera. This frame is processed by YOLOv11, which performs two key tasks: object detection (identifying persons and PPE elements like helmets or vests) and pose estimation (identifying keypoints on the human body). The output includes the location and classification of detected objects, along with pose information.
Next, the system generates a normalized depth map using MiDaS, a state-of-the-art depth estimation model. MiDaS uses an encoder-decoder architecture (e.g., a BEiT Encoder followed by a Decoder) to produce a relative depth map from a single RGB frame. However, this depth map is not in real-world units.
To address this, a calibration step is performed. The system detects a known reference marker, such as an AprilTag, placed at a known distance from the camera. By comparing the AprilTag’s real-world depth to the MiDaS-generated normalized depth values, a real depth conversion factor is calculated. Applying this factor transforms the normalized depth map into approximate real distances. After that, Key points extracted from YOLO pose will be used to determine the depth of target.
Finally, the compliance check examines whether the detected person is wearing the required PPE and is outside of a defined “danger zone.” If someone is within a hazardous distance without proper PPE, the system triggers an alarm as a warning.
!pip install roboflow
from roboflow import Roboflow
rf = Roboflow(api_key="Sj73sOlgWZ0H0AoSX1mk")
project = rf.workspace("ai-project-yolo").project("ppe-detection-q897z")
version = project.version(25)
dataset = version.download("yolov11")
import cv2
import matplotlib.pyplot as plt
import os
image_path = "../dataset/ppe/train/images"
label_path = "../dataset/ppe/train/labels"
# Function to plot images with bounding boxes
def load_image_with_labels(image_file):
# Load image
img = cv2.imread(os.path.join(image_path, image_file))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# Load corresponding label file
label_file = image_file.replace(".jpg", ".txt")
label_file_path = os.path.join(label_path, label_file)
if os.path.exists(label_file_path):
# Read label data and draw bounding boxes
with open(label_file_path, "r") as f:
for line in f:
# Parse YOLO format (class x_center y_center width height)
parts = line.strip().split()
class_id, x_center, y_center, width, height = map(float, parts)
# Convert YOLO format to bounding box coordinates
img_h, img_w = img.shape[:2]
x_center, y_center, width, height = (
x_center * img_w, y_center * img_h, width * img_w, height * img_h
)
x1, y1 = int(x_center - width / 2), int(y_center - height / 2)
x2, y2 = int(x_center + width / 2), int(y_center + height / 2)
# Draw the bounding box
cv2.rectangle(img, (x1, y1), (x2, y2), color=(255, 0, 0), thickness=2)
cv2.putText(img, f"Class {int(class_id)}", (x1, y1 - 10),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 1, cv2.LINE_AA)
return img
# Get the list of image files and select the first five
image_files = [f for f in os.listdir(image_path) if f.endswith(".jpg")]
first_five_images = image_files[:5]
# Plot the first five images with labels in a single subplot
fig, axs = plt.subplots(1, 5, figsize=(20, 5))
fig.suptitle("First 5 Training Images with Annotations", fontsize=16)
for i, image_file in enumerate(first_five_images):
img = load_image_with_labels(image_file)
axs[i].imshow(img)
axs[i].axis("off")
axs[i].set_title(f"Image {i+1}")
plt.show()
from ultralytics import YOLO
import torch
# Load the pretrained YOLO11s model
model = YOLO("yolo11s.pt") # pretrained yolo11s
# Path to the dataset configuration file
data_yaml_path = '../dataset/ppe/data.yaml'
# Set the device (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Train the model on your custom dataset
results = model.train(
data=data_yaml_path, # Path to the dataset YAML file
epochs=300, # Number of epochs to train
imgsz=640, # Input image size
batch=32, # Batch size
device=device, # Device for training
lr0=0.01, # Initial learning rate (adjust as needed)
optimizer="SGD", # Optimizer type (SGD or Adam)
augment=True # Enable data augmentation
)
# Print the results
print("Training completed. Results:", results)

YOLO11s, with 319 layers and 9,429,727 parameters**, is relatively lightweight. At 6.3 GFLOPs, it’s computationally efficient and suitable for faster inference on limited hardware while maintaining high accuracy.
Training:
Testing:
Training vs. Testing Performance:
AUC Curves:
Focus on Recall:
Real-world Examples:
The YOLO11s model demonstrates strong performance in detecting PPE compliance, with high precision and recall across training and testing datasets. While it generalizes well, improvements are needed for No Helmet detection, which can be addressed with targeted data augmentation and compliance methods. Overall, the model is effective for real-world safety monitoring applications, ensuring critical safety violations are identified reliably.
Training Set
Testing Set
from ultralytics import YOLO
from IPython.display import display, clear_output
from pathlib import Path
import matplotlib.pyplot as plt
image_path_test = r'C:\Users\wwr01\OneDrive\Desktop\GRAD\MIE1517\Project\input'
model_path = r'C:\Users\wwr01\OneDrive\Desktop\GRAD\MIE1517\Project\Models\yolo11n-ppe-1111.pt'
clear_output(wait=True)
model = YOLO(model_path) # Path to the best model weights
test_images_path = Path(image_path_test)
test_images = list(test_images_path.glob("*.jpg"))
test_images = test_images[:5]
fig, axs = plt.subplots(1, len(test_images), figsize=(20, 5))
results_list=[]
image_list=[]
for i, image_path in enumerate(test_images):
image_list.append(image_path)
results = model(image_path)
results_list.append(results)
annotated_img = results[0].plot()
annotated_img_rgb = annotated_img[:, :, ::-1]
axs[i].imshow(annotated_img_rgb)
axs[i].axis("off")
axs[i].set_title(f"Image {i+1}")
plt.tight_layout()
plt.show()
image 1/1 C:\Users\wwr01\OneDrive\Desktop\GRAD\MIE1517\Project\input\009a716f953692ba_jpg.rf.fc74d06060acfcb480e49a418a221ce8.jpg: 640x640 1 No helmet, 1 No vest, 1 Person, 8.0ms Speed: 2.0ms preprocess, 8.0ms inference, 64.8ms postprocess per image at shape (1, 3, 640, 640) image 1/1 C:\Users\wwr01\OneDrive\Desktop\GRAD\MIE1517\Project\input\07-25-2021T02034028ee52121b596_jpg.rf.eaa4ca3d7f98f1a255007dd21e0fc1af.jpg: 384x640 4 No helmets, 4 No vests, 4 Persons, 32.9ms Speed: 2.0ms preprocess, 32.9ms inference, 1.0ms postprocess per image at shape (1, 3, 384, 640) image 1/1 C:\Users\wwr01\OneDrive\Desktop\GRAD\MIE1517\Project\input\ppe_0038_png_jpg.rf.c34b5b86fb4d3ab072ffd3d9bd8674bb.jpg: 640x640 1 No vest, 4 Persons, 4 helmets, 3 vests, 9.0ms Speed: 2.0ms preprocess, 9.0ms inference, 1.0ms postprocess per image at shape (1, 3, 640, 640) image 1/1 C:\Users\wwr01\OneDrive\Desktop\GRAD\MIE1517\Project\input\ppe_0259_png_jpg.rf.81fc1f8f11c0c9c5695c6386307a8da2.jpg: 640x640 2 No vests, 3 Persons, 3 helmets, 1 vest, 8.0ms Speed: 2.0ms preprocess, 8.0ms inference, 1.0ms postprocess per image at shape (1, 3, 640, 640) image 1/1 C:\Users\wwr01\OneDrive\Desktop\GRAD\MIE1517\Project\input\ppe_0724_png_jpg.rf.2f52238d13f0e3a587523a60c0d6afbf.jpg: 640x640 2 Persons, 2 helmets, 2 vests, 8.0ms Speed: 2.0ms preprocess, 8.0ms inference, 1.0ms postprocess per image at shape (1, 3, 640, 640)
The compliance check function utilizes the "has_overlap function" which checks if a detected - person's bounding box sufficiently overlaps with any of the bounding boxes for items like helmets or vests*
# Sample Code to check the PPE compliance based on the model predictions
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
for results, image_path in zip(results_list, image_list):
# Add Detections
detections = results[0].boxes.data.tolist()
# Define classes, subject to change
NO_HELMET_CLASS = 0.0
NO_VEST_CLASS = 1.0
PERSON_CLASS = 2.0
HELMET_CLASS = 3.0
VEST_CLASS = 4.0
# Define confidence threshold
confidence_threshold = 0.8
# Separate detections by class with confidence filtering
people_detections = [det for det in detections if det[5] == PERSON_CLASS and det[4] >= confidence_threshold]
helmet_detections = [det for det in detections if det[5] == HELMET_CLASS and det[4] >= confidence_threshold]
vest_detections = [det for det in detections if det[5] == VEST_CLASS and det[4] >= confidence_threshold]
no_helmet_detections = [det for det in detections if det[5] == NO_HELMET_CLASS]
no_vest_detections = [det for det in detections if det[5] == NO_VEST_CLASS]
# Function to check if a detected item overlaps with a person based on bounding box coordinates
def has_overlap(person_box, item_boxes, overlap_threshold=0.1):
px_min, py_min, px_max, py_max = person_box[:4]
for item_box in item_boxes:
ix_min, iy_min, ix_max, iy_max = item_box[:4]
# Calculate intersection coordinates
inter_x_min = max(px_min, ix_min)
inter_y_min = max(py_min, iy_min)
inter_x_max = min(px_max, ix_max)
inter_y_max = min(py_max, iy_max)
# Calculate areas
inter_area = max(0, inter_x_max - inter_x_min) * max(0, inter_y_max - inter_y_min)
item_area = (ix_max - ix_min) * (iy_max - iy_min)
# Calculate overlap ratio
overlap_ratio = inter_area / item_area
# Check if the overlap is above the threshold
if overlap_ratio >= overlap_threshold:
return True # Sufficient overlap found
return False # No item met the overlap threshold
# Check PPE compliance and prepare labels for display
compliance_results = []
for i, person in enumerate(people_detections, start=1):
# Check for required and missing PPE
has_helmet = has_overlap(person, helmet_detections)
has_vest = has_overlap(person, vest_detections)
no_helmet = has_overlap(person, no_helmet_detections)
no_vest = has_overlap(person, no_vest_detections)
# Determine compliance status
missing_items = []
if no_helmet:
missing_items.append("helmet")
if no_vest:
missing_items.append("vest")
if missing_items:
compliance_status = "Non-compliant"
compliance_label = f"Non-compliant: Missing {', '.join(missing_items)}"
elif has_helmet and has_vest:
compliance_status = "Compliant"
compliance_label = "Compliant"
else:
if not has_helmet:
missing_items.append("helmet")
if not has_vest:
missing_items.append("vest")
compliance_status = "Non-compliant"
compliance_label = f"Non-compliant: Missing {', '.join(missing_items)}"
# Store the bounding box coordinates and compliance label
compliance_results.append((person[:4], compliance_status, compliance_label))
# Load the original image
image = Image.open(image_path)
# Create a figure and axis to display the image
fig, ax = plt.subplots(1, figsize=(12, 8))
ax.imshow(image)
if not people_detections:
plt.text(0.5, 0.5, "No person detected", color='red', fontsize=20, fontweight='bold',
ha='center', va='center', transform=ax.transAxes, backgroundcolor="white")
else:
# Draw bounding boxes and labels on the image
for (bbox, compliance_status, compliance_label) in compliance_results:
px_min, py_min, px_max, py_max = bbox
# Draw bounding box
rect = patches.Rectangle((px_min, py_min), px_max - px_min, py_max - py_min,
linewidth=2, edgecolor='green' if compliance_status == "Compliant" else 'red', facecolor='none')
ax.add_patch(rect)
# Add label directly above the bounding box
plt.text(px_min, py_min - 10, compliance_label, color='green' if compliance_status == "Compliant" else 'red',
fontsize=10, fontweight='bold', backgroundcolor="white", bbox=dict(facecolor="white", edgecolor="none", pad=1))
# Display the result
plt.axis('off')
plt.show()
With PPE compliance check, the system is able to evaluate whether workers are correctly wearing PPE. Then building on PPE compliance check, a pre-trained MiDas is used to detect how far worker is to the hazardous and dangerous areas, if worker without PPE compliance, the system automatically triggers an alert, providing a timely warning.
Here is a sample code to predict the depth map of the image using pre-trained MiDas.
import sys
from pathlib import Path
import numpy as np
# Add the project directory to the Python path
sys.path.insert(1, '..')
from scripts.midas_core import MidasCore
# Load the MiDaS model
midas_model_path = '../checkpoints/dpt_beit_large_512.pt'
midas = MidasCore(midas_model_path)
# Load the input image
test_images_path = Path('../dataset/Safety-6/valid/images')
test_images = list(test_images_path.glob("*.jpg"))
# Process each image to obtain depth maps and apply color mapping
depth_maps = [midas.get_depth(image) for image in test_images[:5]]
colored_depth_maps = [cv2.applyColorMap(np.uint8(depth_map), cv2.COLORMAP_VIRIDIS) for depth_map in depth_maps]
# Plot the depth maps in a single figure with 5 subplots
fig, axes = plt.subplots(1, 5, figsize=(15, 5))
for i, ax in enumerate(axes):
# Convert BGR (OpenCV format) to RGB for Matplotlib
ax.imshow(cv2.cvtColor(colored_depth_maps[i], cv2.COLOR_BGR2RGB))
ax.axis("off")
plt.show()
/home/home/school/MIE1517/mie1517_project/scripts/midas/base_model.py:11: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
parameters = torch.load(path, map_location=torch.device('cpu'))
Model loaded, number of parameters = 345M
[PosixPath('../dataset/Safety-6/valid/images/Video4_272_jpg.rf.02a2acc82f87b49e1542edcfc0cd7d35.jpg'), PosixPath('../dataset/Safety-6/valid/images/Video4_30_jpg.rf.9e983ac6c0026272ceba26bc48ca8f39.jpg'), PosixPath('../dataset/Safety-6/valid/images/AightDuo0294_jpg.rf.84dc7eed6e275262009fa354ef937885.jpg'), PosixPath('../dataset/Safety-6/valid/images/Video4_281_jpg.rf.f82866e43a22a4b64d101019b96f7053.jpg'), PosixPath('../dataset/Safety-6/valid/images/Video4_45_jpg.rf.527a05e31d5a1a3efd115c79233711ba.jpg')]
Input resized to 512x512 before entering the encoder
Input resized to 512x512 before entering the encoder
Input resized to 512x512 before entering the encoder
Input resized to 512x512 before entering the encoder
Input resized to 512x512 before entering the encoder
The above results indicated that MiDaS is cable to understand and predict the spatial depth of objects and scenes from a single image input. Usually darker shades represent greated depth relative to the camera and lighter shades represent smaller depth relative to the camera.
Hence we utilized a pre-trained MiDaS model to generate relative normalized depth maps and obtain relative depth of woker to the camera. The following steps summarize the process and results:
Calibration with QR Codes:
Performance Comparison: The following table compares the MiDaS estimated depths to the ground truth values:
| Name | MiDaS Estimated Depths (m) | Ground Truth Depths (m) |
|---|---|---|
| Tag 0 | 4.9 | 3.4 |
| Tag 1 | 3.6 | 2.5 |
Challenges in Depth Estimation:
Enhancement with YOLO Pose Estimation:
Processing Time:
This method improves the reliability of monocular depth estimation by incorporating pose-based refinement, although further optimization is needed to reduce processing time for real-time applications.
import json
import apriltag
def detect_apriltags(image):
''' Detect AprilTags in the input image and return the center coordinates.
Args:
image: Input image
Returns:
tag_centers: List of center coordinates of detected AprilTags
'''
# Initialize the AprilTag detector
options = apriltag.DetectorOptions(families="tag36h11")
detector = apriltag.Detector(options)
# Detect AprilTags in the image
detections = detector.detect(image)
if not detections:
print("[Info] No AprilTags detected.")
return []
# List to store center coordinates
tag_centers = []
for detection in detections:
# Extract the center coordinates
center_x, center_y = detection.center
tag_centers.append((center_x, center_y))
return tag_centers
def depth_to_real(midas_prediction, image):
'''
Transfer relative MiDaS depths to real depths with known points
and save the calibration data to a json file.
Args:
midas_prediction: output from MiDaS (depth map)
image: input image
Returns:
midas_depth_aligned: Real depth map
'''
json_loaded = False
# To calibrate, set calib to True
# To use the existing calibration data, set calib to False
calib = True
if calib:
# start calibration
known_points = detect_apriltags(image)
if len(known_points) < 2:
print("No known points detected.")
return None
point1_x, point1_y, point1_real = known_points[0]
point2_x, point2_y, point2_real = known_points[1]
point1_norm = 1-midas_prediction[int(point1_y), int(point1_x)]
point2_norm = 1-midas_prediction[int(point2_y), int(point2_x)]
if point1_norm != 0 and point2_norm != 0:
a1 = point1_real / point1_norm
a2 = point2_real / point2_norm
if np.isclose(a1, a2, atol=1e-6):
a = (a1 + a2) / 2 # Averaging to be robust
print(f"Scaling factor 'a' is consistent. Using a = {a}")
else:
print("Scaling factors from the two points are inconsistent. Using Least Squares to find the best 'a'.")
# Use Least Squares to find the best a
D_norm = np.array([point1_norm, point2_norm])
D_real = np.array([point1_real, point2_real])
# Since D_real = a * D_norm, it's a simple linear fit without intercept
# a = (D_norm^T D_norm)^-1 D_norm^T D_real
a = np.dot(D_norm, D_real) / np.dot(D_norm, D_norm)
print(f"Computed scaling factor 'a' using Least Squares: {a}")
calibration_data = {
'a': a
}
with open('params/calibration_data.json', 'w') as file:
json.dump(calibration_data, file)
depth_map_real = a * (1-midas_prediction)
depth_map_real -= np.max(depth_map_real)
depth_map_real = np.abs(depth_map_real)
# check if calibration is successful
plt.figure(figsize=(10, 8))
plt.imshow(depth_map_real, cmap='inferno') # Choose a colormap that enhances depth perception
plt.colorbar(label='Depth (meters)')
plt.title('Depth Map Visualization')
plt.xlabel('Pixel X')
plt.ylabel('Pixel Y')
plt.show()
return []
else:
if not json_loaded:
with open('params/calibration_data.json', 'r') as file:
calibration_data = json.load(file)
json_loaded = True
a = calibration_data['a']
midas_depth_aligned = a * (1-midas_prediction)
midas_depth_aligned -= np.max(midas_depth_aligned)
midas_depth_aligned = np.abs(midas_depth_aligned)
return midas_depth_aligned
def estimate_depth(frame):
''' Estimate the depth of people detected in the input frame.
This function uses the MiDaS model to estimate the depth of keypoints detected by YOLO
and averages the depth values to estimate the depth of the person.
Args:
frame: Input frame
Returns:
depth_results: List of tuples containing bounding boxes and estimated depths
'''
# Get the depth map from MiDaS
depth_map = midas.get_depth(frame, render=False)
# Convert relative depth to real depth using known points
real_depth_map = depth_to_real(depth_map, frame)
yolo_pose_model = YOLO('checkpoints/yolo11n-pose.pt', verbose=False)
if real_depth_map is None:
print('Depth calibration failed.')
return []
results_pose = yolo_pose_model(frame, verbose=False)
depth_results = []
for result in results_pose:
keypoints = result.keypoints.xy.cpu().numpy()
boxes = result.boxes.xyxy.cpu().numpy() # Bounding box coordinates (x_min, y_min, x_max, y_max)
# Process each detection
for keypoint, box in zip(keypoints, boxes):
depth = midas.get_depth_keypoints_from_depth_map(real_depth_map, keypoint)
depth_results.append((box, depth))
return depth_results

With the ability to convert relative depth to real depth with acceptable errors, in order to monitor worker's proximity to hazardous area, we defined a real-world distance threshold that when worker's depth is under that threshold while without PPE compliance, a timely warning is triggered.
For the following functions:
plot_bboxes function is responsible for visualizing detected workers within a frame. It overlays bounding boxes, compliance labels, depth information, and warning icon is displayer when non-compliant workers are too close to danger zones.
And overlay_warning_image function simply overlays a warning icon onto the frame at the location of a detected worker who is too close to a danger zone.
def plot_bboxes(self, results, frame, depth_results):
boxes = results[0].boxes.data.tolist()
compliance_results = self.check_compliance(boxes)
if not compliance_results:
# print('No person detected')
pass
else:
for bbox, compliance_status, compliance_label in compliance_results:
# print(bbox)
person_depth = None
for box, depth in depth_results:
if self.has_overlap(bbox, [box]):
person_depth = depth
break
px_min, py_min, px_max, py_max = bbox
if compliance_status == "Compliant":
color = (0, 255, 0)
else:
color = (0, 0, 255)
# Draw bounding box
cv2.rectangle(frame, (int(px_min), int(py_min)), (int(px_max), int(py_max)), color, 2)
label_text = compliance_label
# Draw label
label_position = (int(px_min), int(py_min) - 10)
cv2.putText(frame, label_text, label_position, cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2, lineType=cv2.LINE_AA)
line_spacing = 25 # Space between lines
# Add depth information
if person_depth:
depth_label = f"Depth: {person_depth:.2f}"
cv2.putText(frame, depth_label, (label_position[0], label_position[1] + line_spacing),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2, lineType=cv2.LINE_AA)
# Check if depth is less than threshold
depth_threshold = 1.5 # Meters, adjust as needed
if person_depth <= depth_threshold and compliance_status == "Non-compliant":
# Raise an alert
warning_label = "DANGER: Non-compliant person too close!"
cv2.putText(frame, warning_label, (label_position[0], label_position[1] + 2 * line_spacing),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 2, lineType=cv2.LINE_AA)
self.overlay_warning_image(frame, bbox)
else:
depth_label = "Depth: N/A"
cv2.putText(frame, depth_label, (label_position[0], label_position[1] + line_spacing),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2, lineType=cv2.LINE_AA)
return frame
def overlay_warning_image(self, frame, bbox):
# Calculate position to overlay the warning image
px_min, py_min, px_max, py_max = map(int, bbox)
warning_img = self.warning_image
# Define the size of the warning image relative to the bounding box
bbox_width = px_max - px_min
bbox_height = py_max - py_min
warning_img_width = int(bbox_width * 0.3) #####change this parameter to resize
aspect_ratio = warning_img.shape[0] / warning_img.shape[1] # Height/Width
warning_img_height = int(warning_img_width * aspect_ratio)
warning_img_resized = cv2.resize(warning_img, (warning_img_width, warning_img_height), interpolation=cv2.INTER_AREA)
# Position of warning image(now is top left)
x_offset = px_min
y_offset = py_min + int(bbox_height * 0.15)
# Ensure the warning image is within frame bounds
if x_offset + warning_img_width > frame.shape[1]:
warning_img_width = frame.shape[1] - x_offset
warning_img_resized = warning_img_resized[:, :warning_img_width]
if y_offset + warning_img_height > frame.shape[0]:
warning_img_height = frame.shape[0] - y_offset
warning_img_resized = warning_img_resized[:warning_img_height, :]
if warning_img_resized.shape[2] == 4:
# Split the channels
alpha_mask = warning_img_resized[:, :, 3] / 255.0
alpha_inv = 1.0 - alpha_mask
# Get the color channels of the overlay image
overlay_color = warning_img_resized[:, :, :3]
# Get the region of interest (ROI) from the frame
roi = frame[y_offset:y_offset+warning_img_height, x_offset:x_offset+warning_img_width]
# Blend the overlay with the ROI
for c in range(0, 3):
roi[:, :, c] = (alpha_mask * overlay_color[:, :, c] + alpha_inv * roi[:, :, c])
# Place the blended ROI back into the frame
frame[y_offset:y_offset+warning_img_height, x_offset:x_offset+warning_img_width] = roi
else:
# No alpha channel, simple overlay
frame[y_offset:y_offset+warning_img_height, x_offset:x_offset+warning_img_width] = warning_img_resized
Results:

For detailed instructions on generating videos like the presentation demos, refer to the GitHub README.
Several projects have utilized YOLO architectures for real-time PPE detection:
Personal Protective Equipment Detection using YOLOv8:
This project employs the YOLOv8 model to detect PPE items such as helmets, masks, and safety vests in various environments, including construction sites and manufacturing facilities.
GitHub
Real-time PPE Detection based on YOLO:
This initiative introduces eight deep learning models built on the YOLO architecture for PPE detection. It also provides a high-quality dataset designed to detect individuals, vests, and helmets of different colors.
GitHub
PPE Detection using YOLOv8:
This project focuses on enhancing workplace safety by detecting PPE, including helmets, safety vests, gloves, and safety glasses, utilizing the YOLOv8 object detection algorithm.
GitHub
These projects demonstrate the effectiveness of YOLO models in accurately identifying PPE in real-time, contributing to improved safety compliance in various industries.
MiDaS has been instrumental in monocular depth estimation:
Real-time Depth Estimation using Monocular Vision (MiDaS):
This paper explores real-time monocular depth estimation using MiDaS to improve scene understanding in various applications, including robotics and augmented reality.
arXiv
Depth Estimation and 3D Point Cloud Generation Using MiDaS:
This project leverages the MiDaS model to perform depth estimation from images and generate corresponding 3D point clouds, visualized using Open3D.
GitHub
These past projects explored MiDaS's capability in providing depth estimation from single images. Collectively, these projects underscore the significant progress in PPE detection and depth estimation
The YOLO11s model exhibits strong performance overall, particularly in its ability to balance computational efficiency and accuracy for PPE compliance monitoring. Achieving an AUC score of 0.965 during training and 0.842 on testing data highlights its capability to generalize effectively to unseen scenarios. The performance drop between training and testing datasets is relatively small, suggesting the model is well-trained and not overly fitted to the training data.
One unusual and interesting result is the consistent challenge in detecting the “No Helmet” category. Despite the high overall performance, this class lags behind others, with testing accuracy dropping to 0.71%. This is likely due to imbalanced training data or visual ambiguities, such as occlusions or similarities with non-PPE objects like casual caps. Addressing this issue through targeted data augmentation or model refinement could significantly enhance its detection capabilities.
Another noteworthy insight is the model’s ability to handle complex real-world scenarios, such as distinguishing PPE items from visually similar objects and managing overlapping individuals. These findings underscore the model’s practicality for real-world safety monitoring, though further optimization is needed to improve computational efficiency for real-time applications.
While the depth estimation process using MiDaS and YOLO pose estimation provides additional context, one surprising result is the lack of accuracy in converting the relative depth map to a real depth map. Despite using MiDaS’s largest model and calibrating with two AprilTags of known depth, the mean squared error (MSE) remains at 1.3 meters. This result highlights limitations in the calibration process and suggests the need for improved methods to enhance depth accuracy.
Computational optimization remains a key priority to reduce the 1–1.5 seconds per frame processing time.
Catastrophic Forgetting:
This occurs when a neural network forgets previously learned information. The trained YOLO model can only detect classes that appear in the training dataset. Our initial training data only include helmet and vest classes, and to enable the system to detect different classes of interest so that PPE compliance can be checked, we came up with a few solutions:
The method we use is to switch to a dataset with PPE and person calsses, and use a self-designed & rule-based algorithm to check PPE compliance.
Capabilities of YOLOV11
Using the newst version of YOLO model, YOLO11 introduces significant improvements in architecture and training methods, making it a versatile choice for a wide range of computer vision tasks such as
And to be notice that usually a model with larger size of parameters come with better performance.
Midas Strengths & Limitations:
One of the key learnings from this project was understanding the strengths and limitations of MiDaS for depth estimation.
The large model processes images slowly, making it unsuitable for real-time applications without significant optimization.
In the future, using a smaller pre-trained model or designing a lightweight, fine-tunable network could better balance accuracy and speed. Despite these challenges, MiDaS, combined with an effective calibration method, provided a solid foundation for our project.